Hypothesis Testing

by Jonathan Ferrari

This notebook will go over the general steps on how to do a hypothesis test and go through a few examples. For questions, concerns, or anything else relevant to this notebook, please contact me

Setup¶

Imports¶

Here we will import some libraries needed for this demonstration. Make sure you run this cell

In [1]:
import numpy as np
from datascience import *
from ipywidgets import *
from utils import *
import warnings
warnings.filterwarnings("ignore")

General Procedure for Hypothesis Tests¶

Here is the general procedure in completing a hypothesis test:

Observation ¶

  1. We notice something weird about an procedure or problem.
    1. Most of the time, this is a value that we see that is weird to us.

Reason ¶

  1. We need to come up with ideas about why this is happening.
    1. One possibility is that this weird thing just happened due to chance. Even though it looks weird it is fair to assume that the result could have happened by chance.
      1. This is the Null Hypothesis.
        1. It is the hypothesis we will simulate under.
    2. Another possibility is that the difference is not due to chance, and that in fact, there is something other than randomness that is causing the value to be weird.
      1. This is the Alternative Hypothesis.
        1. It is the hypothesis we do not simulate under.

Weirdness ¶

  1. We need a way to test our ideas
    1. We need a *test statistic* that will quantify how "weird" an event is.
      1. There are many to choose from including: absolute difference, TVD, proportions, etc.
        1. Different test statistics will be better for different situations. It is your job to pick one that will work well.

How Weird ¶

  1. We need to quantify how weird our original event from $1$ is.
    1. We use our test statistic from $3$ to calculate a number.
      1. This number is called our observed value.

Simulation ¶

  1. We need to know how often weird things happen in this event.
    1. We first define a function.
      1. This function simulates the event happening under the Null Hypothesis.
        1. Then, we calculate our test statistic on this random event.

Repetition ¶

  1. We want to simulate an event many times.
    1. We want to keep track of our test statistic for each randomly simulated event.
      1. We can do this by creating an empty array initially.
        1. We the append each test statistic value to the array in a for loop.

Comparison ¶

  1. We need to compare how weird our observed value, compared to the randomly generated statistic values.
    1. To do this we calculate the proportion of times where our observed value was weirder than each random statistic value.
      1. We can sum the amount of times it was more weird, then divide by the total count of random test statistics.
        1. This proportion is called our p-value.

Conclusion ¶

  1. We need to decide if our observed value (the weird observation) is weird enough.
    1. To do this we compare with a cutoff.
      1. If $\text{p-value} \leq \text{cutoff}$ we reject the Null Hypothesis.
      2. Otherwise, we fail to reject the Null Hypothesis.
        1. Note: We cannot accept the Alternative, as we cannot prove that it is true.

Examples¶

Now that we have a framework on how to proceed, lets look at a few examples!

If you click on the links in each of the sub-sections below, they will bring you to the respective explanation above

Example 1: Hand Raising¶

Observation¶

Jonathan is a TA for Data 8, and notices that his students raise their left hands more often than their right hands.

He has $40$ students, and by his count, $27$ raise their left hand and $13$ raise their right hand.

Let's also define some variables to track the numbers in this observed event.

In [2]:
student_count = 40
observed_left = 27
observed_right = 13

Reason¶

Jonathan comes up with two ideas on what could be happening

  • Idea 1: Null Hypothesis
    • Students raise their left and right hands the same amount, any observed difference is due to chance.
  • Idea 2: Alternative Hypothesis
    • Students do not raise their left and right hands equally, the observed difference is due to something other than chance.

Weirdness¶

He now must define a test statistic. He decides to do the Total Variation Distance (TVD). This will be computed as follows:

$$\frac{|\text{count of left hands} - \text{count of right hands}|}{2}$$

He could have chosen something else such as:

$$ |\text{count of left hands} - 20|$$

Let's define a function called tvd to help us compute this!

In [3]:
def tvd(left, right):
    return abs(left - right)/2

How Weird¶

Now let's calculate our observed value using the tvd function.

In [4]:
observed = tvd(observed_left, observed_right)
show(f"The observed value is: {observed}")

The observed value is: 7.0

Simulation¶

Now we need to define a function called simulate which will simulate one event under the null hypothesis, and give us our test statistic for that event.

This function randomly choses "left" or "right" 40 times. It then tells us how many left and right hands it picked, with left first then right.

In [5]:
def simulate():
    hands = make_array("left", "right")
    choices = np.random.choice(hands, student_count)
    left_count = sum(choices == "left")
    right_count = sum(choices == "right")
    return make_array(left_count, right_count)

Run the cell below a few times to check out what kind of values we are getting as we simulate under the null hypothesis.

In [6]:
simulate()
Out[6]:
array([22, 18])

Repetition¶

Now, under the null hypothesis, let's simulate $10,000$ random events and collect the test statistic values for each one in an array called results

In [7]:
results = make_array()
trials = 10_000
for i in np.arange(trials):
    one_trial = simulate()
    one_stat = tvd(one_trial.item(0), one_trial.item(1))
    results = np.append(results, one_stat)
show(f"Collected all {trials:,} statistics")

Collected all 10,000 statistics

Great, now that we have all the statistics, let's compare these values with our observed value.

As an intermediate step, we'll create a table with the data in it

Check out the histogram below

Note: Don't worry about understanding the code below, just focus on the graph!

In [8]:
hands = Table().with_column("Hand TVD", results)
bin_range = np.arange(max(results))
fig = hands.ihist(show=False, unit="hand", bins = bin_range, title = "TVD of Hands")
show("You can hover over each bar to find the exact percent!")
fig

You can hover over each bar to find the exact percent!

Comparison¶

Now, lets compute the p-value of the data! We can do this using the following formula:

$$\frac{\text{number of simulations more extreme than observed value}}{\text{number of simulations}}$$

We'll compute our p-value below (in the first line), and plot a histogram showing it!

In [9]:
p_value = sum(results >= observed)/len(results)
fig = hands.ihist("Hand TVD", bins = bin_range, show = False, 
                  title = f"P-Value: {p_value} | Observed Value: {observed}")
fig = add_observed(fig, observed)
fig

Conclusion¶

Now that we've calculated the p-value, we can come to a conclusion. We need to set a "cutoff" value. This is the value we use to determine if we can reject the null hypothesis. Convention says we should use $0.05$, which is what we will use, but this number can be different in certain cases.

In the cell below, we will run some code to determine if we can reject the null hypothesis!

Try changing the cutoff value and see if the cell outputs something different!

In [10]:
cutoff = 0.05
if p_value < cutoff:
    show(f"We can reject the null hypothesis, because our p-value, {p_value} is less than the cutoff, {cutoff}")
else:
    show(f"We cannot reject the null hypothesis because our p-value, {p_value} is not less than the cutoff {cutoff}")

We can reject the null hypothesis, because our p-value, 0.0404 is less than the cutoff, 0.05

Example 1: Addendum¶

One thing to note is that our test statistic could have been quite different and we may have still gotten the same answer.

Run the cell below and try selecting a different test statistic, and use the slider to change the p-value cutoff, and see what result you get!

Note: The cell will take a few moments to render, so be patient with it!

Note: Don't worry about understanding the code below, just focus on the result!

In [11]:
@interact(statistic_function = Dropdown(options = pairs, description = "Function:"),
          cutoff = FloatSlider(min = 0, max = 0.5, value = 0.05, step = 0.05, description = "Cutoff:"))
def simulate_with_different_statistics(statistic_function, cutoff):
    observed = statistic_function(observed_left, observed_right)
    results = make_array()
    trials = 10_000
    for i in np.arange(trials):
        one_trial = simulate()
        one_stat = statistic_function(one_trial.item(0), one_trial.item(1))
        results = np.append(results, one_stat)
    show(f"Collected all {trials:,} statistics")
    hands = Table().with_columns("Hand Statistic", results)
    p_value = sum(results >= observed)/len(results)
    if all_ints(results):
        bin_range = np.arange(min(results), max(results) + 1)
    else:
        bin_range = None
    fig = hands.ihist("Hand Statistic", show = False, bins = bin_range,
                      title = f"P-Value: {p_value:.3f} | Observed Value: {observed:.3f}")
    fig = add_observed(fig, observed, f"Observed Value: {observed:.3f}")
    return fig
    cutoff = 0.05
    if p_value < cutoff:
        show(f"We can reject the null hypothesis, because our p-value, {p_value:.3f} is less than the cutoff, {cutoff}")
    else:
        show(f"We cannot reject the null hypothesis because our p-value, {p_value:.3f} is not less than the cutoff {cutoff}")

Note: This visualization is not available in static HTML. For the dynamic Datahub notebook see

One important takeaway is many of those test statistics had almost the exact same p-value! This helps illustrate the fact that as long as our test statistic makes sense and we are consistent, we can use many different values! However, importantly for two of the options our p-values were very different. This can go to show how using a bad statistic function can lead us to an incorrect conclusion, so we need to be careful.

Example 2: Population and Politics¶

Observation¶

We have some data relating to U.S. counties in two tables, population and political. Each table has two columns, the first of both is a unique code for each county. The second column in population is the population of that county, and the second column in political is the winner of the 2020 Presidential election in that county (e.g., "Trump" or "Biden").

Lets use .join to get the data into one table!

In [12]:
population = Table.read_table('county-population.csv')
population.show(5)
fips population
1001 55347
1003 203709
1005 26489
1007 22583
1009 57673

... (3229 rows omitted)

In [13]:
political = Table.read_table('county-politics.csv')
political.show(5)
fips winner
1001 Trump
1003 Trump
1005 Trump
1007 Trump
1009 Trump

... (3147 rows omitted)

In [14]:
counties = political.join("fips", population, "fips")
counties.show(5)
fips winner population
1001 Trump 55347
1003 Trump 203709
1005 Trump 26489
1007 Trump 22583
1009 Trump 57673

... (3106 rows omitted)

We would like to do a little bit of analysis of this data, specifically we want to look at the average population of counties that voted for Trump, and ones that voted for Biden.

To do this we will use .group

In [15]:
pop_by_party = counties.group("winner", np.average).drop("fips average")
pop_by_party
Out[15]:
winner population average
Biden 361077
Trump 49249.2

We can see there is a very large difference here, specifically, on average counties that voted for Biden were $7.33$ times larger than those that voted for Trump.

This will be our "weird" observation for this example. We should record these numbers to calculate our observed value.

In [16]:
observed_dem = pop_by_party.column(1).item(0)
observed_rep = pop_by_party.column(1).item(1)

Reason¶

We can come up with two ideas on what could be happening

  • Idea 1: Null Hypothesis
    • The counties that vote democratic and republican have similar population sizes; any difference is due to chance
  • Idea 2: Alternative Hypothesis
    • Counties that have vote democratic have larger populations than those which vote republican.

Weirdness¶

Again, let's define a test statistic. Here we want to use a one-sided statistic because our alternative hypothesis specifies that we think democratic counties have more people on average than republican counties.

We could pick a few different statistics, but the simplest is the difference between the democratic average population, and the republican average population, formulated as follows:

$$\text{average democratic county population} - \text{average republican county population}$$

We could also do the ratio of democratic average population over the republican average population:

$$\frac{\text{average democratic county population}}{\text{average republican county population}}$$

In this case, we will opt for simplicity and choose the difference (the first statistic).

Let's define a function called pop_diff to help us compute this!

In [17]:
def pop_diff(dem, rep):
    return dem - rep

How Weird¶

Now let's calculate our observed value using the tvd function.

In [18]:
observed = pop_diff(observed_dem, observed_rep)
show(f"The observed value is: {observed:,.2f}")

The observed value is: 311,828.21

Simulation¶

Now we need to define a function called simulate which will simulate one event under the null hypothesis, and give us our test statistic for that event. In this case since we have data with labels, we will do an A/B test.

This function shuffles the labels of the counties table and returns the democratic and republican average populations under the null hypothesis.

In [19]:
def simulate():
    shuffled_labels = counties.sample(with_replacement = False).column(1)
    values = counties.column(2)
    new_counties = Table().with_columns("population", values, "winner", shuffled_labels)
    grouped = new_counties.group("winner", np.average)
    dem = grouped.column(1).item(0)
    rep = grouped.column(1).item(1)
    return make_array(dem, rep)
simulate()
Out[19]:
array([ 79077.12662942, 108081.39432789])

Then, to calculate an observed test statistic, we simply apply the pop_diff function!

Run the cell below a few times to check out what kind of values we are getting as we simulate under the null hypothesis.

In [20]:
sim_value = simulate()
pop_diff(sim_value.item(0), sim_value.item(1))
Out[20]:
-12430.205759789562

Repetition¶

Now, under the null hypothesis, let's simulate $1,000$ random events and collect the test statistic values for each one in an array called results.

Fair warning, this cell takes a little while to run!

In [21]:
results = make_array()
trials = 1_000
for i in np.arange(trials):
    one_trial = simulate()
    one_stat = pop_diff(one_trial.item(0), one_trial.item(1))
    results = np.append(results, one_stat)
show(f"Collected all {trials:,} statistics")

Collected all 1,000 statistics

Great, now that we have all the statistics, let's compare these values with our observed value.

As an intermediate step, we'll create a table with the data in it

Check out the histogram below

Note: Don't worry about understanding the code below, just focus on the graph!

In [22]:
votes = Table().with_column("Population Differences", results)
fig = votes.ihist(show=False, unit="person", title = "Difference of Average Population")
fig

Comparison¶

Now, lets compute the p-value of the data! We can do this using the following formula:

$$\frac{\text{number of simulations more extreme than observed value}}{\text{number of simulations}}$$

We'll compute our p-value below (in the first line), and plot a histogram showing it!

In [23]:
p_value = sum(results >= observed)/len(results)
fig = votes.ihist("Population Differences", show = False, 
                  title = f"P-Value: {p_value} | Observed Value: {observed:,.2f}")
fig = add_observed(fig, observed, f"Observed Value:{observed:,.2f}")
fig

Conclusion¶

Note: Here our p-value is $0.0$ because out of all of our trials, none had a more extreme value than the observed value.

Hence, no matter what we set our cutoff to, we will be able to reject the null hypothesis.

Conclusion and Feedback¶

Now you've gone through the general steps for hypothesis testing, and two examples. Hopefully you had half as much fun dunning through the notebook as I had making it!

For further clarification on hypothesis testing, please see Foundations of Data Science – Chapter 11: Testing Hypotheses

I would also appreciate any feedback you have on this lesson. You can access the feedback form by running the cell below and clicking the button that appears! Thanks :)

In [24]:
feedback_button()